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Abstract - Nowadays, the datapaths of modern 
microprocessors reach their limits by usin» state 
mstruction sets. A way out of this liLatTonsTa 
dynamic reconfigurable processor datanath 

set achie r d by ****** «*««oJ22 

datapaths with the coarse-grain dynamic 

SSSf Ie PIatf x 7-^ itectu - (SSS 

processing Platform). Therefore, a looselv 
asynchronous coupling mechanism of the 
corresponding datapath units has been develooed 
»d integrated onto a CMOS 0.13 pm iSSfS 
technology from UMC. Here the SPARr 

S at S!T eI LE ? N PTOCeSSOr 55 w ^r e ^ 
e«ende^ P f u ,nstructi0 " datapath has been 
sSc^c '°r be C ° nfi8Ured md Penalized for 
specific apphcations.^vrj^llows a various and 

hke MPEG-4, d.g,tal filters, mobile communication 
modulation, etc. The chosen coupling technique 

b2S£- r COm P° Und in «™«fo«». which J 
integrated into the programming and comoilarion 
environment of the LEON processor C ° mp,latl ° n 



Introduction 

The limitations of conventional processors are 
becoming more and more evident. ThTg^wS 

coXe^ ° f T baSCd ^-tions^Xf 
coarse-gram dynamically reconfiaurable 

architecture* an attractive alternative [3]" [4] ^ 
[73- They combine the performance of ASICs, which 
are very nsky and expensive (development 
mask costs), with the flexibility of traditional 
processors [5]. uauiuonai 

In spite of the possibilities we have today in VLSI 
development, the basic concepts of mijprleio 
architectures are the same as 20 years ago The mato 
processing unit of mo deni TonventiS 

tollows the same style guidelines as its 
predecessors Although the development of 
P«pelmed architectures or superscalar concepts to 

increases the performance of a modern 



microprocessor and allows higher frequency rates 
Ae mam concept of a static datapath Sma£s 
Therefore, each operation is a composition of bS 

eSof"t S h e that *" M Processor ™ tIic 
benefit of the processor concept lays i n the ability of 

. executmg strong control dominant application Data 
or stream oriented applications are not weS suited 

Ixec U t, ! onTn't° n t r nt - ^ SeqUential '«on 
execution isn t the nght target for that kind nf 

applications and needs high bandwidtS Tecaust of 

permanent retransmitting of instruction/data from 

usinJ '"J*™?- ™ S handica P is ofte « eased by 
us>ng of caches in various stages. A sequent,^ 
mterconnection of filters, which do the acXmg 
data manipulating without writing back h! 
intermediate results would set the 3 J? • 
and reduction of bandwil^cfiSy, < ffSSS 
cham of filters should be constructed £ a foiaf 
way and configured during runtime ExSr 
approach to extend instruction seT use sSc 
modules, not modifiable during runtime 
Customized microprocessors or ASICs are 
optmuzed for one special app, ication ^Snment! 
is nearly impossible to use the Tame 

loosing the performance gain of this architecture 
A new approach of a flexible and high performance 
datapath concept is needed, which' alfows " 
reconfigure the functionality and make d^ core 
mainly application independent wimout losmgle 
performance needed for stream-based applications 
Th.s contribution introduces a newconcepTof 
loosely coupled implementation of the Samfo 
reconfigurable XPP architecture from PACtI Std 
mto a static datapath of the SPARC compatibTe 
LEON processor. Thus, this approach is Sent 

Sir 8, T m operates 48 a c ° m p»S"y 

SS? n ,Ki < T Wr) com P° nent within one 
Configurable System-on-Chip (CsoC) together with 
a Processor core, global/locl, og es 

and effic ie nt multi-layer Amba-bus toterfacesff M 
Here from the programmers point of view the 
extended and adapted datapath seems like a^namt 
configurable mstruction set. It can be customiSdfor 
a specific application and accelerate th7™ eC ution 
enormously. Therefore, the programmer hi to 
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create a number of configurations, which can be 
uploaded to the XPP-Array at run time, e.g. this 
configuration can be used like a filter to calculate 
stream-oriented data. It is also possible, to configure 
more than one function in the same time and use 
them simultaneously. This concept promises an 
enormously performance boost and the needed 
flexibility and power reduction to perform a series 
of applications very effective. 



1. LEON RISC Microprocessor 

For implementation of this concept we chose the 32- 
bit SPARC V8 compatible microprocessor [1] [2], 
LEON. This microprocessor is a synthesisable, free 
available VHDL model which has a load/store 
architecture and has a five stages pipeline 
implementation with seperated instruction and data 
caches. 
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Figure 1: LEON Architecture Overview 



As shown in Figure 1 the LEON is provided with a 
full implementation of AMBA 2.0 AHB and APB 
on-chip bus, a hardware multiplier and devider, 
programmable 8/16/32-bit memory controller for 
external PROM, static RAM and SDRAM and 
several on-chip peripherals such as timers, UARTs, 
interrupt controller and a 16-bit I/O port A simple 
power down mode is implemented as well. 
LEON is developed by the European Space Agency 
(ESA) for future space missions. The performance 
of LEON is close to an ARM9 series but don't have 
a memory management unit (MMU) 
implementation, which limits the use to single 
memory space applications. In Figure 2 the 
datapath of the LEON integer unit is shown. 
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Figure 2: LEON Pipelined Datapath Structure 

2. extreme Processing Platform - XPP 

The XPP architecture [6], [7], [8] is based on a 
hierarchical array of coarse-grain, adaptive 
computing elements called Processing Array 
Elements (PAEs) and a packet-oriented 
communication network. The strength of the XPP 
technology originates from the combination of array 
processing with unique, powerful run-time 
reconfiguration mechanisms. Since configuration 
control is distributed over a Configuration Manager 
(CM) embedded in the array, PAEs can be 
configured rapidly in parallel while neighboring 
PAEs are processing data. Entire applications can be 
configured and run independently on different parts 
of the array. Reconfiguration is triggered externally 
or even by special event signals originating within 
the array, enabling self-reconfiguring designs. By 
utilizing protocols implemented in hardware, data 
and event packets are used to process, generate, 
decompose and merge streams of data. 
The XPP has some similarities with other coarse- 
grain reconfigurable architectures like the 
KressArray [3] or Raw Machines [4]. which are 
specifically designed for stream-based applications. 
XPP's main distinguishing features are its automatic 
packet-handling mechanisms and its sophisticated 
hierarchical configuration protocols for runtime- and 
self-reconfiguration. 
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2.1 Array Structure 

A CM consists of a slate machine and internal RAM 
for configuration caching. The PAC itself (see top 
right-hand side of Figure 3) contains a configuration 
bus which connects the CM with PAEs and other 
configurable objects. Horizontal busses carry data 
and events. They can be segmented by configurable 
switch-objects, and connected to PAEs and special 
I/O objects at the periphery of the device. 
A PAE is a collection of PAE objects. The typical 
PAE shown in Figure 3 (bottom) contains a BREG 
object (back registers) and an FREG object (forward 
registers) which are used for vertical routing, as well 
as an ALU object which performs the actual 
computations. The ALU performs common fixed- 
point arithmetical and logical operations as well as 
several special threeinput opcodes like multiply-add, 
sort, and counters. Events generated by ALU objects 
depend on ALU results or exceptions, very similar 
to the state flags of a classical microprocessor. A 
counter, e.g., generates a special event only after it 
has terminated. The next section explains how these 
events are used. AnotherPAE object implemented 
in the XPP is a memory objecY which can be used in 
FIFO mode or as RAM for lookup tables, 
intermediate results etc. However, any PAE object 
functionality can be included in the XPP 
architecture. 

2.2 Packet Handling and Synchronization 

PAE objects as defined above communicates via a 
packet-oriented network. Two types of packets are 
sent through the array: data packets and event 
packets. Data packets have a uniform bit width 
specific to the device type. In normal operation 
mode, PAE objects are selfsynchronizing. An 
operation is performed as soon as all necessary data 
input packets are available. The results are 
forwarded as soon as they are available, provided 
the previous results have been consumed. Thus it is 
possible to map a signal-flow graph directly to ALU 
objects. Event packets are one bit wide. They 
transmit state information which controls ALU 
execution and packet generation. 

23 Configuration 

Every PAE stores locally its current configuration 
state, i.e. if it is part of a configuration or not (states 
..configured" or ..free"). Once a PAE is configured, 
it changes its state to ..configured". This prevents 
the CM from reconfiguring a PAE which is still 
used by another application. The CM caches the 



configuration data in its internal RAM until the 
required PAEs become available. 




Figure 3: Structure of an XPP device 



While loading a configuration, all PAEs start to 
compute their part of the application as soon as they 
are in state ..configured". Partially configured 
applications are able to process data without loss of 
packets. This concurrency of configuration and 
computation hides configuration latency. 

2.4 XPP Application Mapping 

The Native Mapping Language (NML), a PACT 
proprietary structural language with reconfiguraton 
primitives, was developed by PACT to map 
applications to the XPP array. It gives the 
programmer direct access to all hardware features. 
In NML, configurations consist of modules which 
are specified as in a structural hardware description 
language, similar to, for instance, structural VHDL, 
PAE objects are explicitly allocated, optionally 
placed, and their connections specified. Hierarchical 
modules allow component reuse, especially for 
repetitive layouts. Additionally, NML includes 
statements to support configuration handling. A 
complete NML application program consists of one 
or more modules, a sequence of initially configured 
modules, differential changes, and statements which 
map event signals to configuration and prefetch 
requests. Thus configuration handling is an explicit 
part of the application program. 
A complete XPP Development Suite .(XDS) is 
available from PACT. For more details on XPP- 
based architectures and development tools see [6J. 
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3. LEON Instruction Datapath Extension 

The system is designed to offer a maximum of 
performance. LEON and XPP should be able to 
communicate with each other in a simple and high 
performance manner. While the XPP is a dataflow 
onentated device, the LEON is a general purpose 
processor, suitable for handling control flow [1 1 m 
Therefore, LEON is used for system control. To do 
, ' S A * e . XPP ,s integrated into the datapath of the 
LEON integer unit, which is able to control the 
.XPP. 



aOQDDB ' 
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Figure 4: Extended Datapath Overview 



Due to unpredictable operation time of the XPP 
algorithm, integration of XPP into LEON datapath 
is done in a loosely-coupled way (Figure 4). Thus 
the XPP array can operate independent from the 
LEON, which is able to control and reconfigure the 
XPP during runtime. Since the configuration of XPP 
is handled by LEON, the CM of the XPP is 
unnecessary and can be left out of the XPP array 
The configuration codes are stored in the LEON 
RAM. LEON transfers the needed configuration 
from its system RAM into the XPP and creates the 
needed algorithm on the array. 
To enable a maximum of independence of XPP 
from LEON, all ports of the XPP - input ports as 
well as output ports - are buffered using dual clock 
FIFOs. Dual-clocked FIFOs are implemented into 
the IO-Ports between LEON and XPP. To transmit 
data to the extended XPP-based datapath the data 
are passed through an lO-Port as shown in Figure 5 
In addition to the FIFO the IO-Ports contain logic to 



generate handshake signals and an interrupt request 
signal. The IO-Port for receiving data from XPP S 
similar to Figure 5 except that the reversed direction 
ot the data signals. This enables that XPP can work 
completely independent from LEON as long as there 
are input data available in the input port FIFOs and 
free space for result data in the output port FIFOs 
There are a number of additionally features 
implemented in the LEON pipeline to control the 
data transfer between LEON and XPP 



LEON clack 



data in 



push icq 



fifblWI 




XPP clock 



data out 



xppade 



xpprdy 



Figure 5: LEON-to-XPP dual-clock FIFO 

**** t0 t0 3,1 IO - port containing 
a full FIFO or read from an IO-Port containing an 
empty FIFO a trap is generated. This trap can be 
handled through a trap handler. There is a further 
mechanism - pipeline-holding - implemented, to 
altow LEON holding the pipeline and wait for free 
FIFO space during XPP write access respectively 
wait for a valid FIFO value during XPP read access 
When using pipeline-holding the software developer 
has to avoid reading from an IO-Port with empty 
FIFO while the XPP, respectively the XPP input IO- 
Ports, contains no data to produce outputs In this 
case a deadlock will occur and the complete system 
has to be reseted. 

XPP can generate interrupts for the LEON when 
trying to read a value from an empty FIFO port or to 
write a value to a full FIFO port. The occurrence of 
interrupts indicates, that the XPP array cannot 
process the next step because it has either no input 
values or it cannot output the result value The 
interrupts generated by the XPP are maskable. 
The interface provides information about the FIFOs 
LEON can read the number of valid values the FIFO 
contains. 
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The interface to the XPP appears to the LEON as a 
set of special registers. (Figure 6). These XPP 
registers can be categorized in communication 
registers and status registers. 



contains a clock frequency ratio between LEON and 
XPP. By writing this register LEON software can 
set the XPP clock relative to LEON clock. This 
allows to adapt the XPP clock frequency to the 




Figure 6: Extended LEON Instruction Pipeline 



For data exchange the XPP communication registers 
are used. Since XPP provides three different types 
of communication ports, there are also three types of 
communication registes, whereas every type is 
splitted into an input part and an output part: 
The data for the process are accessed through XPP 
data registers. The number of data input and data 
output ports as well as the data bitwidth depends on 
the implemented XPP array. 
XPP can generate and consume events. Events are 
one bit signals. The number of input events and 
output events depends on the implemented XPP 
array again. 

Configuration of the XPP is done through the XPP 
configuration register. LEON reads the required 
configuration value from a file - stored in his 
system RAM - and writes it to the XPP 
configuration register. 

There are a number of XPP status register 
implemented to control the behavior and get status 
information of the interface. Switching between the 
usage of trap handling and pipeline holding can be 
done in the hold register. A XPP clock register 



required XPP performance and consequently to 
influence the power consumption of the system 
Writing zero to the XPP clock register turns off the 
XPP. At last there is a status register for every FIFO 
containing the number of valid values actually 
available in the FIFO. 

This status registers provides a maximum of 
felxibility in communication between LEON and 
XPP and enables different communication modes: 
If there is only one application running on the 
system at the time, software may be developed 
in pipeline-hold mode. Here LEON initiates 
data read or write from respectively to XPP. If 
there is no value to read respectively no value 
to write, LEON pipeline will be stopped until 
read or write is possible. This can be used to 
reduce power consumption of the LEON part. 
- In interrupt mode, XPP can influence the LEON 
program flow. Thus, the lO-Ports generates an 
interrupt depending on the actual number of 
values available in the FIFO. The 
communication between LEON and XPP as 
done in interrupt service routines. 
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Polling mode is the classical way to access the 
XPP. Initiated by a timer-event LEON reads all' 
XPP ports containing data and writes all XPP 
ports containing free FIFO space. Between 
these phases LEON can compute other 
calculations. 

It is anytime possible to switch between this 
strategies within one application. 
The XPP is delivered containing a configuration 
manager to handle configuration and reconfiguration 
of the array. In this concept the configuration 
manager is dispensable because the configuration as 
well as any reconfiguration is controlled by the 
LEON through the XPP configuration register. All 
XPP configurations used for an application are 
stored in the LEON's system RAM. 



the assembler of the binutils has been extended by a 
number of instructions according to the 
implemented instruction subset. The new 
instructions have the same mnemonic as the regular 
SPARC V8 load, store, read and write instructions. 
Only the new XPP registers have to be used as 
source respectively target operand. Since the 
modifications of LECCS are straightforward 
extensions, the cross compiler system is backward 
compatible to the original version. The availability 
of the source code of LECCS has allowed to extend 
the tools by the new XPP operations in the described 
way. 

The development of the XPP algorithms have to be 
done with separate tools, provided by PACT Corp. 



4. Tool and Compiler Integration 

The LEON's SPARC 8 instruction set [1] was 
extended by a new subset of instructions to make the 
new XPP registers accessable through software. 
These instructions are based in the SPARC 
instruction format but they are not conform to the 
SPARC V8 standard. CSrrSponding to the SPARC 
conventions of a load/store Architecture the 
instruction subset can be splitted in two general 
types. Load/store 
instructions can exchange 
data between the LEON 
memory and the XPP 
communication registers. 
The number of cycles per 
instruction are similar to the 
standard load/store 
instructions of the LEON. 
Read/write instructions are 
used for communications 
between LEON registers. Since the LEON register- 
set is extended by the XPP registers the read/write 
instructions are extended also to access XPP 
registers. Status registers can only be accessed with 
read/write instructions. Execution of arithmetic 
instructions on XPP registers is not possible. Values 
have to be written to standard LEON registers 
before they can be target of arithmetic operations. 
The complete system can still operate any SPARC 
V8 compatiple code. Doing this, the XPP is 
completely unused. 

The LEON is provided with the LECCS cross 
compiler system [9] standing under the terms of 
LGPL. This system consists of modified versions of 
the binutils 2.11 and gcc 2.95.2. To make the new 
instruction subset available to software developers, 



5. Application Results 

As a first analysis application a inverse DCT applied 
to 8x8 pixel block was implemented. For all 
simulations we used 250 MHz clock frequency for 
LEON processor and 50 MHZ clock frequency for 
XPP. The usage of XPP accelerates the computation 
of the IPCT .about 





LEON alone 


LEON with XPP 
in IRQ Mode 


LEON with XPP 
in Poll Mode 


LEON with XPP 
in Hold Mode 


Configuration 
of XPP 




7 1.308 ns 
17.827 cycles 


84.364 ns 
21.091 cycles 


77.976 ns 
19.494 cycles 


2D IDCT(8X8) 


14.672 ns 
3.668 cycles 


3.272 ns 
818 cycles 


3.872 ns 
968 cycles 


3.568 ns 
892 cycles 



Table 1 Performance on 1DCT (8x8) 
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Figure 7 Computation Time of IDCT (8x8) 
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factor four, depending on the communication mode 
However XPP has to be configured before 
computing the IDCT on it. Table 1 also shows the 
configuration time for this algorithm. As shown in 



performance boost of this concept against the 
standalone LEON will be increased. 

6. Conclusion 
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Figure 7, the benefit brought by XPP rises with the 
number of IDCT blocks computed by it before 
reconfiguration, so the number of reconfigurations 
during complex algorithms should be minimised. 
A first complex application implemented on the 
system is MPEG-4 decoding. The optimization of 
the algorithm partitioning on LEON and XPP is still 
under construction. In Figure 8 the blockdiagram of 
the MPEG-4 decoding algorithm is shown. Frames 
with 320 x 240 pixel was decpded. LEON by using 
SPARC V8 standard instructions decodes one frame 
["iJ?' 46 seconds - In a first implementation of 
MPEG-4 using the XPP, only the IDCT is computed 
by XPP, the rest of the MPEG-4 decoding is still 
done with LEON. Now, with the help of XPP, one 
frame ts decoded in 17,98 s. This is a performance 
boost of more then twenty percent. Since the XPP 
performance gain by accelerating the iDCT 
algorithm only is very low in the moment we work 
on XPP implementations of Huffmann-decoding, 
dequantisation and prediction-decoding. So the 



Today, the instruction datapaths of modern 
microprocessors reach their limits by using static 
instruction sets, driven by the traditional von 
Neumann or Harvard architectural principles. A way 
out of these limitations is a dynamic reconfigurable 
processor datapath extension achieved "by 
integrating traditional static datapaths with the 
coarse-grain dynamic reconfigurable XPP- 
architecture (extreme Processing Platform) 
Therefore, a loosely asynchronous coupling 
mechanism of the given instruction datapath has 
been developed and integrated onto a CMOS 0.13 
um standard cell technology from UMC. Here the 
SPARC compatible LEON RISC processor is used, 
whereas its static pipelined instruction datapath has 
been extended to be configured and personalized for 
specific applications. This compiler-compatible 
instruction set extension allows a various and 
efficient use, e.g. in streaming application domains 
like MPEG-4, digital filters, mobile communication 
modulation, etc. The introduced coupling technique 
by flexible dual-clock FIFO interfaces allows 
asynchronous concurrency and adapting the 
frequency of the configured XPP datapath 
dependent on actual performance requirements eg 
for avoiding unneeded cycles and reducing power 
consumption. 

As represented above, the introduced concept 
combines the flexibility of a general purpose 
microprocesser with the performance and power 
consumption of coarse-grain reconfigurable 
datapath structures, nearly comparable to ASIC 
performance. Here, two programming and 
computing paradigms (control-driven von Neumann 
and transport-triggered XPP) are unified within one 
hybrid architecture with the option of two clock 
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domains The ability to reconfigure the transport- 
triggered XPP makes the system independent from 
standards or specific applications. This concept 
opens potenial to develop multi-standard 
communication devices like software radios by 
using one extended processor architecture with 
adapted programming and compilation tools. Thus 
new standards can be easily implemented through 
software updates. The system is scalable durin* 
design time through the scalable array-structure of 
the used XPP extension. This extends the range of 
suitable applications from products with less 
mulhmadia functions to complex high performance 
systems. 
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Claims 



Method of simultaneously operating a sequential proces- 
sor and a reconf igurable array wherein data are trans- 
ferred into said reconf igurable array from a data cache 
to said array and wherein results produced in said array 
from said data are written to a destination. 

Method according to claim 1, wherein said destination is 
placed upstream the arithmetic unit of said sequential 
processor . 

Method according to the previous claim, wherein the data 
output from said reconf igurable array is, at least in 
part, fed into the data path of said processor unit 



Akte: PACT45/EP " 



downstream the decoding circuitry of said processing 
unit . 

4. Method according to any of the previous claims, wherein 
5 the arithmetic logic unit of said processor is adapted 

to perform at least one operation on said data outputted 
from said reconf igurable array. 

-•5. Method according to any of the previous claims, wherein 
10 the arithmetic-logic-circuitry comprises circuitry .for 

multiplication and/or division and/or in particular said 
operation performed on said data outputted from said re- 
configurable array comprises a multiplication and/or di- 
vision and/or norming. 

15 ^ ~x_j 

6. Method according to any of the previous claims, wherein 
said data outputted from said reconf igurable array is, 
preferably selectably writable to a memory location 
other then said cache and/or the register of said se- 

20 quential processing unit. 

7. Method according to any of the previous claims, wherein 
said destination is downstream of the arithmetic logic 
unit and/or upstream of the cache coupled to said 

25 processing unit. 



